Experiments in Farsi Text Retrieval

نویسندگان

  • FARHAD OROUMCHIAN
  • NAGHMEH KARIMI
  • MINA ZOLFY
چکیده

-A series of experiments is being conducted on the Farsi language in the domain of laws in the university of Tehran. One of the goals of these experiments is to establish the performance of different weighting schemes and retrieval models. For the lack of a Farsi stemmer and some characterisitics of the language, it was decided to experiment with N-grams. With un-stemmed words and 2-grams, 3-grams and 4-grams and three weighting schemes (dnb.dtn, lnc.ltc and tfc.nfx), 12 experiments were conducted and the first top 20 documents were judged. It seems 3grams perform as well as unstemmed systems for lnc.ltc and tfc.nfx weights. However dnb.dtn outperforms any Ngram weighting schemes. Two linear combinations of the 3-grams and unstemmed words and a linear combination of the 3-grams, 4-grams and unstemmed words were also tried with the hope that each method could capture some aspects of the language. However the results did not show any improvement.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Farsi/Arabic Document Image Retrieval through Sub -Letter Shape Coding for mixed Farsi/Arabic and English text

A retrieval method for explicit recognition free Farsi/Arabic document is proposed in this paper. The system can be used in mixed Farsi/Arabic and English text. The method consists of Preprocessing, word and sub_word extraction, detection and cancelation of sub_letter connectors, annotation sub_letters by shape coding, classifier of sub_letters by use of decision tree and using of RBF neural ne...

متن کامل

Using Text Surrounding Method to Enhance Retrieval of Online Images by Google Search Engine

Purpose: the current research aimed to compare the effectiveness of various tags and codes for retrieving images from the Google. Design/methodology: selected images with different characteristics in a registered domain were carefully studied. The exception was that special conceptual features have been apportioned for each group of images separately. In this regard, each group image surr...

متن کامل

تعیین مرز و نوع عبارات نحوی در متون فارسی

Text tokenization is the process of tokenizing text to meaningful tokens such as words, phrases, sentences, etc. Tokenization of syntactical phrases named as chunking is an important preprocessing needed in many applications such as machine translation information retrieval, text to speech, etc. In this paper chunking of Farsi texts is done using statistical and learning methods and the grammat...

متن کامل

Assessment of a Modern Farsi Corpus

The development of Language Engineering (LE) and Information Retrieval (IR) applications requires availability of sizeable, reliable and representative corpora. This paper describes how we have constructed a well-structured 345 MB tagged corpus of news, and presents some beneficial statistics of this corpus based upon the characteristics of Farsi language. It also goes into particular detail on...

متن کامل

Improving K-Nearest Neighbor Efficacy for Farsi Text Classification

One of the common processes in the field of text mining is text classification.Because of the complex nature of Farsi language, words with separate parts and combined verbs, the most of text classification systems are not applicable to Farsi texts.K-Nearest Neighbors (KNN) is one of the most popular used methods for text classification and presents good performance in experiments on different d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001